Lecture 9: August 25th, 2023#

Reminders:

  • All EDA outcome quizzes have been posted. Attempt the ones you’re missing, and let me know if any issues come up. Come to student hours it any issues come up! Anthony and I are here to help.

  • “50 years of data science” token-earning assignment due tonight at midnight. As always, this is optional.

  • I’m almost done writing the new homeworks for next week and they will be uploaded by tonight. They will be due Week 4 Friday at midnight instead of Wednesday.

Coming up:

  • On Monday, we’ll go through the instructions for the final project.

  • The planning worksheet for the final project will be due during Week 5.

Today:

  • We’ll introduce Machine Learning (ML)

  • We’ll start by coding for linear regression

  • Anthony will go through a worksheet on generating data for regression problems. Definitely go, if you are able to!

Introduction to Machine Learning#

Let’s take another fieldtrip…to the iPad!

Performing Linear Regression Using scikit-learn#

import pandas as pd
import altair as alt
import seaborn as sns
  • Import the taxis data from Seaborn.

df = sns.load_dataset("taxis")
df.sample(5)
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
3807 2019-03-14 09:26:10 2019-03-14 10:03:06 1 4.90 25.0 2.00 0.0 30.30 yellow credit card Lincoln Square East East Village Manhattan Manhattan
4041 2019-03-12 18:42:51 2019-03-12 18:54:04 1 2.10 9.5 2.75 0.0 16.55 yellow credit card Midtown North Central Park Manhattan Manhattan
3546 2019-03-27 19:46:15 2019-03-27 20:03:56 6 3.29 14.0 2.50 0.0 20.80 yellow credit card Midtown Center Central Park Manhattan Manhattan
5937 2019-03-01 12:04:25 2019-03-01 12:05:24 1 0.15 7.0 1.40 0.0 8.40 green credit card Central Harlem North Central Harlem North Manhattan Manhattan
1531 2019-03-20 11:38:48 2019-03-20 11:48:46 1 0.70 7.5 2.15 0.0 12.95 yellow credit card Midtown Center Midtown North Manhattan Manhattan
  • Drop rows with missing values

df = df.dropna()
  • Using Altair, make a scatter plot with “fare” on the y-axis and with “distance” on the x-axis.

alt.Chart(df).mark_circle().encode(
    x="distance",
    y="fare"
)
---------------------------------------------------------------------------
MaxRowsError                              Traceback (most recent call last)
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/v5/api.py:2520, in Chart.to_dict(self, *args, **kwargs)
   2518     copy.data = core.InlineData(values=[{}])
   2519     return super(Chart, copy).to_dict(*args, **kwargs)
-> 2520 return super().to_dict(*args, **kwargs)

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/v5/api.py:838, in TopLevelMixin.to_dict(self, *args, **kwargs)
    836 copy = self.copy(deep=False)  # type: ignore[attr-defined]
    837 original_data = getattr(copy, "data", Undefined)
--> 838 copy.data = _prepare_data(original_data, context)
    840 if original_data is not Undefined:
    841     context["data"] = original_data

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/v5/api.py:100, in _prepare_data(data, context)
     98 # convert dataframes  or objects with __geo_interface__ to dict
     99 elif isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
--> 100     data = _pipe(data, data_transformers.get())
    102 # convert string input to a URLData
    103 elif isinstance(data, str):

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
    608 """ Pipe a value through a sequence of functions
    609 
    610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
   (...)
    625     thread_last
    626 """
    627 for func in funcs:
--> 628     data = func(data)
    629 return data

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
    302 def __call__(self, *args, **kwargs):
    303     try:
--> 304         return self._partial(*args, **kwargs)
    305     except TypeError as exc:
    306         if self._should_curry(args, kwargs, exc):

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/data.py:19, in default_data_transformer(data, max_rows)
     17 @curried.curry
     18 def default_data_transformer(data, max_rows=5000):
---> 19     return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
    608 """ Pipe a value through a sequence of functions
    609 
    610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
   (...)
    625     thread_last
    626 """
    627 for func in funcs:
--> 628     data = func(data)
    629 return data

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
    302 def __call__(self, *args, **kwargs):
    303     try:
--> 304         return self._partial(*args, **kwargs)
    305     except TypeError as exc:
    306         if self._should_curry(args, kwargs, exc):

File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/utils/data.py:82, in limit_rows(data, max_rows)
     80     values = data
     81 if max_rows is not None and len(values) > max_rows:
---> 82     raise MaxRowsError(
     83         "The number of rows in your dataset is greater "
     84         f"than the maximum allowed ({max_rows}).\n\n"
     85         "See https://altair-viz.github.io/user_guide/large_datasets.html "
     86         "for information on how to plot large datasets, "
     87         "including how to install third-party data management tools and, "
     88         "in the right circumstance, disable the restriction"
     89     )
     90 return data

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000).

See https://altair-viz.github.io/user_guide/large_datasets.html for information on how to plot large datasets, including how to install third-party data management tools and, in the right circumstance, disable the restriction
alt.Chart(...)

Here, we get a MaxRowsError; Altair can only work with data that has less than or equal to 5000 rows.

  • Choose 5000 random rows to avoid the max_rows error.

Let’s get a random selection of 5000 rows from df. I’m not going to worry about getting reliable random rows, the point of this part is just to get a feel for what the data looks like.

alt.Chart(df.sample(5000)).mark_circle().encode(
    x="distance",
    y="fare"
)

Looking at the data, it seems to be roughly linear. It’s not perfectly linear, but we should be able to approximate a line pretty well. The only weird thing is that horizontal line…let’s see what’s going on there by adding a tooltip.

James brought up a great point: some of the rides go a distance of zero miles…and are still charged. Let’s remove these points from our data, because this seems very strange.

alt.Chart(df.sample(5000)).mark_circle().encode(
    x="distance",
    y="fare",
    tooltip=["dropoff_zone","pickup_zone","fare","distance"]
)
df2 = df.sample(5000,random_state=10)
df2 = df2[df2["distance"] > 0]
alt.Chart(df2).mark_circle().encode(
    x="distance",
    y="fare",
    tooltip=["dropoff_zone","pickup_zone","fare","distance"]
)

The horizontal line all involves rides going to or from an airport. This looks like some kind of fixed price promotion where you can go to the airport (or get picked up from the airport) and go anywhere within a region for a fixed price.

  • What would you estimate is the slope of the “line of best fit” for this data?

We have the points \((0.02,2.5)\) and \((5,16)\)

#The slope 
(16-2.5)/(5-0.02)
2.710843373493976

If I had to approximte the line, I’d say the slope is about 2.71.

There is a routine in scikit-learn that we will see many times! Starting now!

1.) Import 2.) Instantiate (create an instance of an object from an appropriate class) 3.) Fit 4.) Predict

  • Find this slope using the LinearRegression class from scikit-learn.

#1.) import
from sklearn.linear_model import LinearRegression
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [11], in <cell line: 2>()
      1 #1.) import
----> 2 from sklearn.linear_model import LinearRegression

ModuleNotFoundError: No module named 'sklearn'

Create a LinearRegression object and name it reg (for regression)

#2.) Instantiate
reg = LinearRegression()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [12], in <cell line: 2>()
      1 #2.) Instantiate
----> 2 reg = LinearRegression()

NameError: name 'LinearRegression' is not defined
type(reg)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 type(reg)

NameError: name 'reg' is not defined

We see reg is a linear regression object. This is not from base python, it belongs to scikit-learn.

Below, let’s try to fit the data. We’re going to get an error, and I can say that you will most likely run into this error many times on your own.

#3.) Fit
reg.fit(df2["distance"],df2["fare"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [14], in <cell line: 2>()
      1 #3.) Fit
----> 2 reg.fit(df2["distance"],df2["fare"])

NameError: name 'reg' is not defined

What goes wrong here is that reg.fit expects a two dimensional array for the input, but we passed the pandas Series df["distance]. We should think of pandas Series as one-dimensional objects.

df2["distance"].shape
(4972,)

Notice the blank after the comma when we call shape. This is letting us know that the pandas Series in one dimension.

Observe the difference with the following:

df2[["distance"]]
distance
2871 2.80
898 1.20
845 2.10
1580 3.35
4002 10.70
... ...
1812 1.20
2191 13.11
4827 2.68
4326 1.60
5779 1.47

4972 rows × 1 columns

df2[["distance"]].shape
(4972, 1)

The example above is treated as a DataFrame with just one column. This is what happens when I pass a list df[[...]].

One way that we can remember when we did two dimensions versus one dimenion is the use of capital letters. The capital “X” means that we need two dimensions, while the lower-case “y” means we need a single dimension.

reg.fit(df2[["distance"]],df2["fare"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 reg.fit(df2[["distance"]],df2["fare"])

NameError: name 'reg' is not defined

At this point, reg has done all of the hard work of finding a linear equation that approximates our data (“fare” as a linear function of “distance”.)

Recall: The original question was asking us to find the slope. Here’s how we can get it:

Slop is stored as the coef_ attribute.

reg.coef_
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 reg.coef_

NameError: name 'reg' is not defined

Notice that this is a NumPy array, if I wanted to extract just the number, I could do this:

reg.coef_[0]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 reg.coef_[0]

NameError: name 'reg' is not defined

We had estimated before that the slope would be about 2.71, so I think we did a pretty good job :)

  • Find the intercept.

The intercept is stored as the intercept_ attribute.

reg.intercept_
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [21], in <cell line: 1>()
----> 1 reg.intercept_

NameError: name 'reg' is not defined

Putting these together, the equation of our line is given by: $\( \text{fare} \approx 2.7284866819996245*(\text{distance}) + 4.660714229453321 \)$

Good Question from the Chat: Why does reg.intercept_ not give you an array.

Answer: It has to do with how the function looks. In our case, we had just one input that we were training on: distance. So our model looks like what we wrote above. We don’t need to just consider distance by itself, we could also consider distance, number of people, and the hour of the taxi ride. If we train on these variables, then we get 3 distinct coefficients. These coefficients will be returned in a NumPy array.

\[ \text{fare} \approx c_0*(\text{distance}) + c_1*(\text{number of people}) + c_2*(\text{time}) + \text{intercept} \]
  • What are the predicted outputs for the first 5 rows? What are the actual outputs?

df2[:5]
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
2871 2019-03-12 20:28:02 2019-03-12 20:43:16 1 2.80 12.0 3.15 0.00 18.95 yellow credit card Upper East Side South East Village Manhattan Manhattan
898 2019-03-24 13:17:38 2019-03-24 13:31:41 1 1.20 10.0 2.65 0.00 15.95 yellow credit card Murray Hill Clinton East Manhattan Manhattan
845 2019-03-04 13:22:23 2019-03-04 13:38:07 1 2.10 11.5 2.96 0.00 17.76 yellow credit card Midtown East Upper West Side South Manhattan Manhattan
1580 2019-03-21 23:31:03 2019-03-21 23:42:56 1 3.35 12.0 3.16 0.00 18.96 yellow credit card Kips Bay Lincoln Square East Manhattan Manhattan
4002 2019-03-16 08:55:35 2019-03-16 09:37:31 3 10.70 39.0 9.10 5.76 54.66 yellow credit card Manhattan Valley LaGuardia Airport Manhattan Queens

Notice, we have a distance of 2.8 and a fare of 12. The model will predict the following for a distance of 2.8:

reg.coef_*2.8 + reg.intercept_
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [23], in <cell line: 1>()
----> 1 reg.coef_*2.8 + reg.intercept_

NameError: name 'reg' is not defined
reg.predict(df2[:5][["distance"]])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 reg.predict(df2[:5][["distance"]])

NameError: name 'reg' is not defined

reg.fit' is still a little mysterious, but reg.predict` is not, it just evaluates our linear function at the distances.

Interpreting Linear Regression Coefficients#

  • Add a new column to the DataFrame, called “hour”, which contains the hour at which the pickup occurred.

df2.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')
df2.dtypes
pickup             datetime64[ns]
dropoff            datetime64[ns]
passengers                  int64
distance                  float64
fare                      float64
tip                       float64
tolls                     float64
total                     float64
color                      object
payment                    object
pickup_zone                object
dropoff_zone               object
pickup_borough             object
dropoff_borough            object
dtype: object
df2["hour"] = df2["pickup"].dt.hour
df2.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough hour
2871 2019-03-12 20:28:02 2019-03-12 20:43:16 1 2.80 12.0 3.15 0.00 18.95 yellow credit card Upper East Side South East Village Manhattan Manhattan 20
898 2019-03-24 13:17:38 2019-03-24 13:31:41 1 1.20 10.0 2.65 0.00 15.95 yellow credit card Murray Hill Clinton East Manhattan Manhattan 13
845 2019-03-04 13:22:23 2019-03-04 13:38:07 1 2.10 11.5 2.96 0.00 17.76 yellow credit card Midtown East Upper West Side South Manhattan Manhattan 13
1580 2019-03-21 23:31:03 2019-03-21 23:42:56 1 3.35 12.0 3.16 0.00 18.96 yellow credit card Kips Bay Lincoln Square East Manhattan Manhattan 23
4002 2019-03-16 08:55:35 2019-03-16 09:37:31 3 10.70 39.0 9.10 5.76 54.66 yellow credit card Manhattan Valley LaGuardia Airport Manhattan Queens 8
  • Remove all rows from the DataFrame where the hour is 16 or earlier. (So we are only using late afternoon and evening taxi rides.)

That’s all we got to today! We’ll pick back up on Monday.

  • Add a new column to the DataFrame, called “duration”, which contains the amount of time in minutes of the taxi ride.

Hint 1. Because the “dropoff” and “pickup” columns are already date-time values, we can subtract one from the other and pandas will know what to do.

Hint 2. I expected there to be a minutes attribute (after using the dt accessor) but there wasn’t. Call dir to see some options.

  • Fit a new LinearRegression object, this time using “distance”, “hour”, “passengers” as the input features, and using “duration” as the target value.

Created in deepnote.com Created in Deepnote